{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Azure Machine Learning to track datasheets for models\n",
    "\n",
    "Learn how to implement datasheets for machine learning assets using the Azure Machine Learning Python SDK.\n",
    "\n",
    "## What are datasheets\n",
    "\n",
    "Datasheets are a way to document machine learning assets that are used and created as part of the machine learning lifecycle. For example in the context of datasets, it can provide information such as:\n",
    "\n",
    "- How and why a dataset was created\n",
    "- How the dataset should and shouldn't be used\n",
    "- Highlight potential ethical and legal considerations\n",
    "\n",
    "The concept is inspired by the electronics industry. It is standard for electronic components to include datasheets with information such as operating characteristics, recommended usage, and other information.\n",
    "\n",
    "In addition to datasets, datasheets can be used other machine learning assets used and created as part of the machine learning lifecycle. Models are such an asset where datasheets are extremely helpful. Models tend to be thought of as \"black boxes\" and often there is little information about them. Because machine learning systems are becoming more pervasive and are used for decision making, using datasheets is a step towards developing more responsible machine learning systems. Datasheets for models can help improve transparency by helping system developers think deeply about the systems they implement and provide users with the information on how a model was built as well as its intended use.\n",
    "\n",
    "## Datasheets for models\n",
    "\n",
    "[Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles (ABOUT ML)](https://www.partnershiponai.org/about-ml/), an initiative from the Partnership in AI (PAI), is a set of guidelines for machine learning system developers to develop, test, and implement machine learning system documentation. \n",
    "\n",
    "The current guidelines apply to models built using static data in models trained using various method such as supervised learning, unsupervised learning, and reinforcement learning. The level of disclosure depends on whether the intended audience is external or internal.\n",
    "\n",
    "Some model information you might want to document as part of a datasheet:\n",
    "\n",
    "- Intended use\n",
    "- Model architecture\n",
    "- Training data used\n",
    "- Evaluation data used\n",
    "- Training model performance metrics\n",
    "- Fairness information. See the [fairness in machine learning article](./concept-fairness-ml.md) to learn more.\n",
    "\n",
    "## Use Azure Machine Learning SDK to implement datasheets for models\n",
    "\n",
    "In the machine learning training workflow, as part of the training process, once you have a trained model, you usually register it. As part of this registration process, you can include additional information about the model. Using the `tags` parameter in the [`register`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#profile-workspace--profile-name--models--inference-config--input-data-none--input-dataset-none--cpu-none--memory-in-gb-none--description-none-) method of the `Model` class.\n",
    "\n",
    "## Example for OpenAI GPT-2 model\n",
    "In this example we will \n",
    "- download the GPT-2 model from [https://github.com/openai/gpt-2/](https://github.com/openai/gpt-2/)\n",
    "- register it in Azure Machine Learning with tags to provide rich metadata to construct a datasheet\n",
    "- render the datasheet using native Jupyter notebook capabilities using [https://github.com/openai/gpt-2/blob/master/model_card.md](https://github.com/openai/gpt-2/blob/master/model_card.md) as inspiration\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting started: Clone the GPT-2 repo "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/openai/gpt-2.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download the trained GPT-2 Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./gpt-2/download_model.py 124M "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect to the Azure Machine Learning workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core import Model, Workspace\n",
    "from IPython.core.display import display, Markdown\n",
    "from markdown import markdown\n",
    "\n",
    "ws = Workspace.from_config()\n",
    "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Register the GPT-2 Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Registering model gpt-2\n",
      "Name: gpt-2\n",
      "Version: 5\n"
     ]
    }
   ],
   "source": [
    "model = Model.register(workspace=ws,\n",
    "                       model_name='gpt-2',                # Name of the registered model in your workspace.\n",
    "                       model_path='./models/124M',  # Local folder\n",
    "                       model_framework=Model.Framework.PYTORCH,  # Framework used to create the model.\n",
    "                       model_framework_version='1.3',             # Version of PyTorch to create the model.\n",
    "                       description='This model was developed by researchers at OpenAI to help us understand how the capabilities of language model capabilities scale as a function of the size of the models',\n",
    "                       tags={'title': 'GPT-2 model card',\n",
    "    'datasheet_description':\n",
    "\"\"\"\n",
    "Last updated: November 2019\n",
    "\n",
    "Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we’re providing some accompanying information about the GPT-2 family of models we're releasing.\n",
    "\n",
    "\"\"\",\n",
    "    'details': 'This model was developed by researchers at OpenAI to help us understand how the capabilities of language model capabilities scale as a function of the size of the models (by parameter count) combined with very large internet-scale datasets (WebText).',\n",
    "    'date': 'February 2019, trained on data that cuts off at the end of 2017.', \n",
    "    'type': 'Language model',\n",
    "    'version': '1.5 billion parameters: the fourth and largest GPT-2 version. We have also released 124 million, 355 million, and 774 million parameter models.',\n",
    "    'help': 'https://forms.gle/A7WBSbTY2EkKdroPA',\n",
    "    'usecase_primary': \n",
    "\"\"\"\n",
    "The primary intended users of these models are *AI researchers and practitioners*.\n",
    "\n",
    "We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and                                 constraints of large-scale generative language models.\n",
    "\"\"\",\n",
    "    'usecase_secondary':\n",
    "\"\"\"\n",
    "Here are some secondary use cases we believe are likely:\n",
    "\n",
    "- **Writing assistance**: Grammar assistance, autocompletion (for normal prose or code)\n",
    "- **Creative writing and art**: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.\n",
    "- **Entertainment**: Creation of games, chat bots, and amusing generations.\n",
    "\"\"\",\n",
    "    'usecase_outofscope':\n",
    "\"\"\"\n",
    "Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.\n",
    "\n",
    "Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.\n",
    "\"\"\",\n",
    "    'dataset_description':\n",
    "\"\"\"\n",
    "This model was trained on (and evaluated against) WebText, a dataset consisting of the text contents of 45 million links posted by users of the ‘Reddit’ social network. WebText is made of data derived from outbound links from Reddit and does not consist of data taken directly from Reddit itself. Before generating the dataset we used a blocklist to ensure we didn’t sample from a variety of subreddits which contain sexually explicit or otherwise offensive content.\n",
    "\n",
    "To get a sense of the data that went into GPT-2, we’ve [published a list](domains.txt) of the top 1,000 domains present in WebText and their frequency.  The top 15 domains by volume in WebText are: Google, Archive, Blogspot, GitHub, NYTimes, Wordpress, Washington Post, Wikia, BBC, The Guardian, eBay, Pastebin, CNN, Yahoo!, and the Huffington Post.\n",
    "\"\"\",\n",
    "    'motivation': 'The motivation behind WebText was to create an Internet-scale, heterogeneous dataset that we could use to test large-scale language models against. WebText was (and is) intended to be primarily for research purposes rather than production purposes.',\n",
    "    'caveats':\n",
    "\"\"\"\n",
    "Because GPT-2 is an internet-scale language model, it’s currently difficult to know what disciplined testing procedures can be applied to it to fully understand its capabilities and how the data it is trained on influences its vast range of outputs. We recommend researchers investigate these aspects of the model and share their results.\n",
    "\n",
    "Additionally, as indicated in our discussion of issues relating to potential misuse of the model, it remains unclear what the long-term dynamics are of detecting outputs from these models. We conducted [in-house automated ML-based detection research](https://github.com/openai/gpt-2-output-dataset/tree/master/detector) using simple classifiers, zero shot, and fine-tuning methods. Our fine-tuned detector model reached accuracy levels of approximately 95%. However, no one detection method is a panacea; automated ML-based detection, human detection, human-machine teaming, and metadata-based detection are all methods that can be combined for more confident classification. Developing better approaches to detection today will give us greater intuitions when thinking about future models and could help us understand ahead of time if detection methods will eventually become ineffective.\n",
    "\"\"\"})\n",
    "\n",
    "print('Name:', model.name)\n",
    "print('Version:', model.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define helper functions to collect custom tags and render datasheet\n",
    "Azure Machine Learning is extensible and allows custom tags to be used with models. You can choose which tags to use to meet your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_tag(tagname):\n",
    "    text = ''\n",
    "    try:\n",
    "        text = tags[tagname]\n",
    "    except:\n",
    "        print('Missing tag ' + tagname)\n",
    "    finally:\n",
    "        return text\n",
    "\n",
    "    return text\n",
    "\n",
    "def get_datasheet(tags):\n",
    "    \n",
    "    title = get_tag('title')\n",
    "    description = get_tag('datasheet_description')\n",
    "    details = get_tag('details')\n",
    "    date = get_tag('date')\n",
    "    modeltype = get_tag('type')\n",
    "    version = get_tag('version')\n",
    "    helpresources = get_tag('help')\n",
    "    usecase_primary = get_tag('usecase_primary')\n",
    "    usecase_secondary = get_tag('usecase_secondary')\n",
    "    usecase_outofscope = get_tag('usecase_outofscope')\n",
    "    dataset_description = get_tag('dataset_description')\n",
    "    motivation = get_tag('motivation')\n",
    "    caveats = get_tag('caveats')\n",
    "\n",
    "    datasheet = ''\n",
    "    datasheet+=markdown(f'# {title} \\n {description} \\n')\n",
    "    datasheet+=markdown(f'## Model Details \\n {details} \\n')\n",
    "    datasheet+=markdown(f'### Model date \\n {date} \\n')\n",
    "    datasheet+=markdown(f'### Model type \\n {modeltype} \\n')\n",
    "    datasheet+=markdown(f'### Model version \\n {version} \\n')\n",
    "    datasheet+=markdown(f'### Where to send questions or comments about the model \\n Please send questions or concerns using [{helpresources}]({helpresources}) \\n')\n",
    "    datasheet+=markdown('## Intended Uses:\\n')\n",
    "    datasheet+=markdown(f'### Primary use case \\n {usecase_primary} \\n')\n",
    "    datasheet+=markdown(f'### Secondary use case \\n {usecase_secondary} \\n')\n",
    "    datasheet+=markdown(f'### Out of scope \\n {usecase_outofscope} \\n')\n",
    "    datasheet+=markdown('## Evaluation Data:\\n')\n",
    "    datasheet+=markdown(f'### Datasets \\n {dataset_description} \\n')\n",
    "    datasheet+=markdown(f'### Motivation \\n {motivation} \\n')\n",
    "    datasheet+=markdown(f'### Caveats \\n {caveats} \\n')\n",
    "\n",
    "    return datasheet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get a model registered in the workspace\n",
    "We'll fetch the model from the model registry. In this sample we'll just grab the same model we registered above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = ws.models['gpt-2']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use built-in Jupyter renderer for the datasheet based on custom tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h1>GPT-2 model card</h1>\n",
       "<p>Last updated: November 2019</p>\n",
       "<p>Inspired by <a href=\"https://arxiv.org/abs/1810.03993\">Model Cards for Model Reporting (Mitchell et al.)</a>, we’re providing some accompanying information about the GPT-2 family of models we're releasing.</p><h2>Model Details</h2>\n",
       "<p>This model was developed by researchers at OpenAI to help us understand how the capabilities of language model capabilities scale as a function of the size of the models (by parameter count) combined with very large internet-scale datasets (WebText). </p><h3>Model date</h3>\n",
       "<p>February 2019, trained on data that cuts off at the end of 2017. </p><h3>Model type</h3>\n",
       "<p>Language model </p><h3>Model version</h3>\n",
       "<p>1.5 billion parameters: the fourth and largest GPT-2 version. We have also released 124 million, 355 million, and 774 million parameter models. </p><h3>Where to send questions or comments about the model</h3>\n",
       "<p>Please send questions or concerns using <a href=\"https://forms.gle/A7WBSbTY2EkKdroPA\">https://forms.gle/A7WBSbTY2EkKdroPA</a> </p><h2>Intended Uses:</h2><h3>Primary use case</h3>\n",
       "<p>The primary intended users of these models are <em>AI researchers and practitioners</em>.</p>\n",
       "<p>We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and                                 constraints of large-scale generative language models.\n",
       " </p><h3>Secondary use case</h3>\n",
       "<p>Here are some secondary use cases we believe are likely:</p>\n",
       "<ul>\n",
       "<li><strong>Writing assistance</strong>: Grammar assistance, autocompletion (for normal prose or code)</li>\n",
       "<li><strong>Creative writing and art</strong>: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.</li>\n",
       "<li><strong>Entertainment</strong>: Creation of games, chat bots, and amusing generations.\n",
       " </li>\n",
       "</ul><h3>Out of scope</h3>\n",
       "<p>Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.</p>\n",
       "<p>Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.\n",
       " </p><h2>Evaluation Data:</h2><h3>Datasets</h3>\n",
       "<p>This model was trained on (and evaluated against) WebText, a dataset consisting of the text contents of 45 million links posted by users of the ‘Reddit’ social network. WebText is made of data derived from outbound links from Reddit and does not consist of data taken directly from Reddit itself. Before generating the dataset we used a blocklist to ensure we didn’t sample from a variety of subreddits which contain sexually explicit or otherwise offensive content.</p>\n",
       "<p>To get a sense of the data that went into GPT-2, we’ve <a href=\"domains.txt\">published a list</a> of the top 1,000 domains present in WebText and their frequency.  The top 15 domains by volume in WebText are: Google, Archive, Blogspot, GitHub, NYTimes, Wordpress, Washington Post, Wikia, BBC, The Guardian, eBay, Pastebin, CNN, Yahoo!, and the Huffington Post.\n",
       " </p><h3>Motivation</h3>\n",
       "<p>The motivation behind WebText was to create an Internet-scale, heterogeneous dataset that we could use to test large-scale language models against. WebText was (and is) intended to be primarily for research purposes rather than production purposes. </p><h3>Caveats</h3>\n",
       "<p>Because GPT-2 is an internet-scale language model, it’s currently difficult to know what disciplined testing procedures can be applied to it to fully understand its capabilities and how the data it is trained on influences its vast range of outputs. We recommend researchers investigate these aspects of the model and share their results.</p>\n",
       "<p>Additionally, as indicated in our discussion of issues relating to potential misuse of the model, it remains unclear what the long-term dynamics are of detecting outputs from these models. We conducted <a href=\"https://github.com/openai/gpt-2-output-dataset/tree/master/detector\">in-house automated ML-based detection research</a> using simple classifiers, zero shot, and fine-tuning methods. Our fine-tuned detector model reached accuracy levels of approximately 95%. However, no one detection method is a panacea; automated ML-based detection, human detection, human-machine teaming, and metadata-based detection are all methods that can be combined for more confident classification. Developing better approaches to detection today will give us greater intuitions when thinking about future models and could help us understand ahead of time if detection methods will eventually become ineffective.\n",
       " </p>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.core.display import display,Markdown\n",
    "\n",
    "tags = model.tags\n",
    "display(Markdown(get_datasheet(tags)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.7.4 64-bit ('base': conda)",
   "language": "python",
   "name": "python37464bitbaseconda361d760225224d43a327d63ab7e3f60c"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
